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ABSTRACT 

Recurrent neural networks (RNNs), particularly long 
short-term memory (LSTM), have gained much attention in 
automatic speech recognition (ASR). Although some suc¬ 
cessful stories have been reported, training RNNs remains 
highly challenging, especially with limited training data. Re¬ 
cent research found that a well-trained model can be used as 
a teacher to train other child models, by using the predictions 
generated by the teacher model as supervision. This knowl¬ 
edge transfer learning has been employed to train simple 
neural nets with a complex one, so that the final performance 
can reach a level that is infeasible to obtain by regular train¬ 
ing. In this paper, we employ the knowledge transfer learning 
approach to train RNNs (precisely LSTM) using a deep neu¬ 
ral network (DNN) model as the teacher. This is different 
from most of the existing research on knowledge transfer 
learning, since the teacher (DNN) is assumed to be weaker 
than the child (RNN); however, our experiments on an ASR 
task showed that it works fairly well: without applying any 
tricks on the learning scheme, this approach can train RNNs 
successfully even with limited training data. 

Index Terms — recurrent neural network, long short¬ 
term memory, knowledge transfer learning, automatic speech 
recognition 

1. INTRODUCTION 

Deep learning has gained significant success in a wide range 
of applications, for example, automatic speech recognition 
(ASR) ffl. A powerful deep learning model that has been 
reported effective in ASR is the recurrent neural network 
(rnn), e.g„ mrnm. An obvious advantage of RNNs com¬ 
pared to conventional deep neural networks (DNNs) is that 
RNNs can model long-term temporal properties and thus are 
suitable for modeling speech signals. 

A simple training method for RNNs is the backpropaga- 
tion through time algorithm 0. This first-order approach. 
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however, is rather inefficient due to two main reasons: (1) 
the twists of the objective function caused by the high nonlin¬ 
earity; (2) the vanishing and explosion of gradients in back- 
propagation 0. In order to address these difficulties (mainly 
the second), a modified architecture called the long short-term 
memory (LSTM) was proposed in m and has been success¬ 
fully applied to ASR (8). In the echo state network (ESN) ar¬ 
chitecture proposed by |9l, the hidden-to-hidden weights are 
not learned in the training so the problem of odd gradients 
does not exist. Recently, a special variant of the Hessian-free 
(HF) optimization approach was successfully applied to learn 
RNNs from random initialization mm. a particular prob¬ 
lem of the HF approach is that the computation is demand¬ 
ing. Another recent study shows that a carefully designed 
momentum setting can significantly improve RNN training, 
with limited computation and can reach the performance of 
the HF method (12]. Although these methods can address the 
difficulties of RNN training to some extent, they are either too 
tricky (e.g., the momentum method) or less optimal (e.g., the 
ESN method). Particularly with limited data, RNN training 
remains difficult. 

This paper focuses on the LSTM structure and presents 
a simple yet powerful training algorithm based on knowl¬ 
edge transfer. This algorithm is largely motivated by the re¬ 
cently proposed logit matching El and dark knowledge dis¬ 
tiller fin. The basic idea of the knowledge transfer approach 
is that a well-trained model involves rich knowledge of the 
target task and can be used to guide the training of other mod¬ 
els. Current research focuses on learning simple models (in 
terms of structure) from a powerful yet complex model, or an 
ensemble of models mm based on the idea of model com¬ 
pression fl5l . In ASR, this idea has been employed to train 
small DNN models from a large and complex one CH. 

In this paper, we conduct an opposite study, which em¬ 
ploys a simple DNN model to train a more complex RNN. 
Different from the existing research that tries to distill knowl¬ 
edge from the teacher model, we treat the teacher model as a 
regularization so that the training process of the child model 
is smoothed, or a pre-training step so that the supervised train¬ 
ing can be located at a good starting point. This in fact leads 
to a new training approach that is easy to perform and can 
be extended to any model architecture. We employ this idea 



to address the difficulties in RNN training. The experiments 
on an ASR task with the Aurora4 database verified that the 
proposed method can significantly improve RNN training. 

The reset of the paper is organized as follows. Section [2] 
briefly discusses some related works, and Section 0 presents 
the method. Section[4]presents the experiments, and the paper 
is concluded by Section[5] 

2. RELATED TO PRIOR WORK 

This study is directly motivated by the work of dark knowl¬ 
edge distillation Ifl4l . The important aspect that distinguishes 
our work from others is that the existing methods focus on 
distilling knowledge of complex model and use it to im¬ 
prove simple models, whereas our study uses simple models 
to teach complex models. The teacher model in our work 
in fact knows not so much, but it is sufficient to provide a 
rough guide that is important to train complex models, such 
as RNNs in the present study. 

Another related work is the knowledge transfer between 
DNNs and RNNs, as proposed in G3- However, it employs 
knowledge transfer to train DNNs with RNNs. This still fol¬ 
lows the conventional idea described above, and so is different 
from ours. 


3. RNN TRAINING WITH KNOWLEDGE 
TRANSFER 

3.1. Dark knowledge distiller 

The idea that a well-trained DNN model can be used as a 
teacher to guide the training of other models was proposed by 
several authors almost at the same time Hamm- The ba¬ 
sic assumption is that the teacher model encodes rich knowl¬ 
edge for the task in hand and this knowledge can be distilled 
to boost the child model which is often simpler and can not 
learn many details without the teacher’s guide. There are a 
few ways to distill the knowledge. The logit matching ap¬ 
proach proposed by |[T3ll teaches a child model by encourag¬ 
ing its logits (activations before softmax) close to those of 
the teacher model in terms of the t -2 norm, and the dark 
knowledge distiller model proposed by m encourages the 
posterior probabilities (softmax output) of the child model 
close to those of the teacher model in terms of cross entropy. 
This transfer learning has been applied to learn simple mod¬ 
els to approach the performance of a complex model or a large 
model ensemble, for example, learning a small DNN from a 
large DNN fl6l or a DNN from a more complex RNN fTTIl . 

We focus on the dark knowledge distiller approach as it 
showed better performance in our experiments. Basically, a 
well-trained DNN model plays the role of a teacher and gen¬ 
erates posterior probabilities of the training samples as new 
targets for training other models. These posterior probabili¬ 
ties are called ‘soft targets’ since the class identities are not as 
deterministic as the original one-hot ‘hard targets’. To make 
the targets softer, a temperature T can be applied to scale the 


logits in the softmax, formulated as pi = 


h/ t 




jr where i.j 


index the output units. The introduction of T allows more in¬ 
formation of non-targets to be distilled. For example, a train¬ 
ing sample with the hard target [10 0] does not involve any 
rank information for the second and third class; with the soft 
targets, e.g., [0.8, 0.15, 0.5], the rank information of the sec¬ 
ond and third class is reflected. Additionally, with a large 
T applied, the target is even softer, e.g, [0.6, 0.25, 0.15], 
which allows the non-target classes to be more prominent in 
the training. Note that the additional rank information on the 
non-target classes is not available in the original target, but 
is distilled from the teacher model. Additionally, a larger T 
boosts information of non-target classes but at the same time 
reduces information of target classes. If T is very large, the 
soft target falls back to a uniform distribution and is not infor¬ 
mative any mor^H Therefore, T controls how the knowledge 
is distilled from the teacher model and hence needs to be set 
appropriately according to the task in hand. 


3.2. Dark knowledge for complex model training 

Dark knowledge, in the form of soft targets, can be used not 
only for boosting simple models, but also for training com¬ 
plex models. We argue that training with soft targets offers 
at least two advantages: (1) it provides more information for 
model training and (2) it makes the training more reliable. 
These two advantages are particularly important for training 
complex models, especially when the training data is limited. 

Firstly, soft targets offer probabilistic class labels which 
are not so ‘definite’ as hard targets. On one hand, this matches 
the real situation where uncertainty always exists in classifi¬ 
cation tasks. For example, in speech recognition, it is often 
difficult to identify the phone class of a frame due to the ef¬ 
fect of co-articulation. On the other hand, this uncertainty in¬ 
volves rich (but less discriminative) information within a sin¬ 
gle example. For example, the uncertainty in phone classes 
indicates phones are similar to each other and easy to get con¬ 
fused. Making use of this information in the form of soft tar¬ 
gets (posterior probabilities) helps improve statistical strength 
of all phones in a collaborative way, and therefore is particu¬ 
larly helpful for phones with little training data. 

Secondly, soft targets blur the decision boundary of 
classes, which offers a smooth training. The smoothness 
associated with soft targets has been noticed in fl4l . which 
states that soft targets result in less variance in the gradient 
between training samples. This can be easily verified by 
looking at the gradients backpropagated to the logit layer, 
which is ti — yi for the z-th logit, where ti is the target and y t 
is the output of the child model in training. The accumulated 


1 This argument should be not confused with the conclusion in on where 
it was found that when T is also applied to the child net, a large T is equal 
to logit matching. The assumption of this equivalence is that T is large com¬ 
pared to the magnitude of the logit values, but not infinitely large. In fact, if 
T is very large, the gradient will approach zero so no knowledge is distilled 
from the teacher model. 




variance is given by: 

Var(t) = Y,{E x (U - Vi) 2 - (E X U - E x yi) 2 } 

i 

where the expectation E, : is conducted on the training data x. 
If we assume that E x t t is identical for soft and hard targets 
(which is reasonable if the teacher model is well trained on 
the same data), then the variance is given by: 

Var(t) = y E x (tj - Vi) 2 + const 

i 

where const is a constant term. If we assume that the child 
model can well learn the teacher model, the gradient variance 
approaches to zero with soft targets, which is impossible with 
hard targets even if when the training has converged. 

The reduced gradient variance is highly desirable when 
training deep and complex models such as RNNs. We argue 
that it can mitigate the risk of gradient vanishing and explo¬ 
sion that is well known to hinder RNN training, leading to a 
more reliable training. 

3.3. Regularization view 

It has been known that including both soft and hard targets 
improves performance with appropriate setting of a weight 
factor to balance their relative contributions M- This can be 
formulated as a regularized training problem, with the objec¬ 
tive function given by: 

&(0) = aJ? H (0)+: Sf’s(fl) 

= yy{atij +pi j )ln{y ij (d)} 

* 3 

where 6 represents the parameters of the model, and 

Jzfs(0) are the cost associated with the hard and soft targets 
respectively, and a is the weight factor. Additionally, tj 3 and 
Pij are the hard and soft targets for the i-th sample on the j-th 
class, respectively. Note that [[(()) is the objective func¬ 
tion of the conventional supervised training, and so J2?s(0) 
plays a role of regularization. The effect of the regulariza¬ 
tion term is to force the model under training (child model) 
to mimic the teacher model, a way of knowledge transfer. In 
this study, a DNN model is used as the teacher model to reg¬ 
ularize the training of an RNN. With this regularization, the 
RNN training looks for optima which produce similar targets 
as the DNN does, so the risk of over-fitting and under-fitting 
can be largely reduced. 

3.4. Pre-training view 

Instead of training the model with soft and hard targets al¬ 
together, we can first train a reasonable model with soft tar¬ 
gets, and then refine the model with hard targets. By this way, 
the transfer learning plays the role of pre-training, and the 
conventional supervised training plays the role of fine-tuning. 


The rationale is that the soft targets results in a reliable train¬ 
ing so can be used to conduct model initialization. However, 
since the information involved in soft targets is less discrimi¬ 
native, refinement with hard targets tends to be helpful. This 
can be informally interpreted as teaching the model with less 
but important discriminative information firstly, and once the 
model is strong enough, more discriminative information can 
be learned. 

This leads to a new pre-training strategy based on dark 
knowledge transfer. In the conventional pre-training ap¬ 
proaches based on either restricted Boltzmann machine 
(RBM) U8l or auto-encoder (AE) fl9l , simple models are 
trained and stacked to construct complex models. The dark 
knowledge pre-training functions in a different way: it makes 
a complex model trainable by using less discriminative in¬ 
formation (soft targets), while the model structure does not 
change. This approach possesses several advantages: (1) it is 
totally supervised and so more task-oriented; (2) it pre-trains 
the model as a whole, instead of layer by layer, so tends to be 
fast; (3) it can be used to pre-train any complex models for 
which the layer structure is not clear, such as the RNN model 
that we focus on in this paper. 

The pre-training view is related to the curriculum train¬ 
ing method discussed in lf2Ql . where training samples that are 
easy to learn are firstly selected to train the model, while more 
difficult ones are selected later when the model has been fairly 
strong. In the dark knowledge pre-training, the soft targets 
can be regarded as easy samples for pre-training, and hard 
targets as difficult samples for fine-tuning. 

Interestingly, the regularization view and the pre-training 
view are closely related. The pre-training is essentially a reg¬ 
ularization that places the model to some location in the pa¬ 
rameter space where good local minima can be easily reached. 
This relationship between regularization and pre-training has 
been discussed in the context of DNN training ED. 

4. EXPERIMENTS 

To verify the proposed method, we use it to train RNN acous¬ 
tic models for an ASR task which is known to be difficult. 
Note that all the RNNs we mention in this section are in¬ 
deed LSTMs. The experiments are conducted on the Au- 
rora4 database in noisy conditions, and the data profile is 
largely standard: 7137 utterances for model training, 4620 
utterances for development and 4620 utterances for testing. 
The Kaldi too I kit [[22] is used to conduct the model training 
and performance evaluation, and the process largely follows 
the Aurora4 s5 recipe for GPU-based DNN training. Specif¬ 
ically, the training starts from constructing a system based 
on Gaussian mixture models (GMM) with the standard 13- 
dimensional MFCC features plus the first and second order 
derivatives. A DNN system is then trained with the align¬ 
ment provided by the GMM system. The feature used for the 
DNN system is the 40-dimensional Fbanks. A symmetric 11- 
frame window is applied to concatenate neighboring frames, 
and an LDA transform is used to reduce the feature dimension 
to 200, which forms the DNN input. The DNN architecture 


involves 4 hidden layers and each layer consists of 2048 units. 
The output layer is composed of 2008 units, equal to the total 
number of Gaussian mixtures in the GMM system. The cross 
entropy is used as the training criterion, and the stochastic 
gradient descendent (SGD) algorithm is employed to perform 
the training. 

In the dark knowledge transfer learning, the trained DNN 
model is used as the teacher model to generate soft targets 
for the RNN training. The RNN architecture involves 2 lay¬ 
ers of LSTMs with 800 cells per layer. The unidirectional 
LSTM has a recurrent projection layer as in a while the 
non-recurrent one is discarded. The input features are the 40- 
dimensional Fbanks, and the output units correspond to the 
Gaussian mixtures as in the DNN. The RNN is trained with 4 
streams and each stream contains 20 continuous frames. The 
momentum is empirically set to 0.9, and the starting learning 
rate is set to 0.0001 by default. 

The experimental results are reported in Table [Q The per¬ 
formance is evaluated in terms of two criteria: the frame ac¬ 
curacy (FA) and the word error rate (WER). While FA is more 
related to the training criterion (cross entropy), WER is more 
important for speech recognition. In Table [T] the FAs are re¬ 
ported on both the training set (TR FA) and the cross valida¬ 
tion set (CV FA), and the WER is reported on the test set. 

In Table Q] RNN -0 is the RNN baseline trained with hard 
targets. RNN-T1 and RNN-T2 are trained with dark knowl¬ 
edge transfer, where the temperature T is set to 1 and 2 re¬ 
spectively. For each dark knowledge transfer model, the soft 
targets are employed in three ways: in the ‘soft’ way, only soft 
targets are used in RNN training; in the ‘reg.’ way, the soft 
and hard targets are used together, and the soft targets play 
the role of regularization, where the gradients of the soft’s are 
scaled up with t 2 im In the ‘pretrain’ way, the soft tar¬ 
gets and the hard targets are used sequentially, and the soft 
targets play the role of pre-training. The weight factor in the 
regularization approach is empirically set to 0.5. 



Targets 

FA% 

TR 

FA% 

CV 

WER% 

DNN 

Hard 

63.0 

45.2 

11.40 

RNN-0 

Hard 

67.3 

51.9 

13.57 

RNN-T1 (soft) 

Soft 

59.4 

49.9 

11.46 

RNN-T1 (reg.) 

Soft + Hard 

67.5 

53.7 

10.84 

RNN-T1 (pretrain) 

Soft, Hard 

65.5 

54.2 

10.71 

RNN-T2 (soft) 

Soft 

58.2 

49.5 

11.32 

RNN-T2 (reg.) 

Soft + Hard 

65.8 

53.3 

10.88 

RNN-T2 (pretrain) 

Soft, Hard 

64.6 

54.1 

10.57 


Table 1: Results with Different Models and Training Methods 

It can be observed that the RNN baseline (RNN-0) can not 
beat the DNN baseline in terms of WER, although much ef¬ 
fort has been devoted to calibrate the training process, includ¬ 
ing various trials on different learning rates and momentum 
values. This is consistent with the results published with the 
Kaldi recipe. Note that this does not mean RNNs are inferior 
to DNNs. From the FA results, it is clear that the RNN model 


leads to better quality in terms of the training objective. Un¬ 
fortunately, this advantage is not propagated to WER on the 
test set. Additionally, the results shown here can not be in¬ 
terpreted as that RNNs are not suitable for ASR (in terms of 
WER). In fact several researchers have reported better WERs 
with RNNs, e.g., 0. Our results just say that with the Au- 
rora4 database, the RNN with the basic training method does 
not generalize well in terms of WER, although it works well 
in terms of the training criterion. 

This problem can be largely solved by the dark knowl¬ 
edge transfer learning, as demonstrated by the results of the 
RNN-T1 and RNN-T2 systems. It can be seen that with the 
soft targets only, the RNN system obtains equal (T=l) or even 
better (T=2) performance in comparison with the DNN base¬ 
line, which means that the knowledge embedded in the DNN 
model has been transferred to the RNN model, and the knowl¬ 
edge can be arranged in a better form within the RNN struc¬ 
ture. Paying attention to the FA results, it can be seen that 
the knowledge transfer learning does not improve accuracy 
on the training set, but leads to better or close FAs on the CV 
set compared to the DNN and RNN baseline. This indicates 
that transfer learning with soft targets sacrifices the FA per¬ 
formance on the training set a little, but leads to better gener¬ 
alization on the CV set. Additionally, the advantage on WER 
indicates that the generalization is improved not only in the 
sense of data sets, but also in the sense of evaluation metrics. 

When combining soft and hard targets, either in the way 
of regularization or pre-training, the performance in terms of 
both FA and WER is improved. This confirms the hypothesis 
that the knowledge transfer learning does play roles of regu¬ 
larization and pre-training. Note that in all these cases, the 
FA results on the training set are lower than that of the RNN 
baseline, which confirms that the advantage of the knowl¬ 
edge transform learning resides in improving generalizabil- 
ity of the resultant model. When comparing the two dark 
knowledge RNN systems with different temperatures T, we 
see T=2 leads to little worse FAs on the training and CV set, 
but slightly better WERs. This confirms that a higher tem¬ 
perature generates a smoother direction and leads to better 
generalization. 

5. CONCLUSION 

We proposed a novel RNN training method based on dark 
knowledge transfer learning. The experimental results on the 
ASR task demonstrated that knowledge learned by simple 
models can be effectively used to guide the training of com¬ 
plex models. This knowledge can be used either as a regu¬ 
larization or for pre-training, and both approaches can lead 
to models that are more generalizable, a desired property for 
complex models. The future work involves applying this tech¬ 
nique to more complex models that are difficult to train with 
conventional approaches, for example deep RNNs. Knowl¬ 
edge transfer between heterogeneous models is under investi¬ 
gation as well, e.g., between probabilistic models and neural 
models. 
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